You've been offered an internship in the analytical department at Yandex.Afisha. Your first task is to help optimize marketing expenses. You have:
You are going to study:
import math
import pandas as pd
from scipy import stats as st
import numpy as np
import matplotlib.pyplot as plt
import warnings
import plotly.express as px
import seaborn as sns
import matplotlib.colors as mcolors
import matplotlib.dates as mdates
warnings.filterwarnings ("ignore")
%matplotlib inline
local_path1 = '/Users/yoni/Pictures/jupyter/Business Analytics/visits_log_us.csv'
local_path2 = '/Users/yoni/Pictures/jupyter/Business Analytics/orders_log_us.csv'
local_path3 = '/Users/yoni/Pictures/jupyter/Business Analytics/costs_us.csv'
visits = pd.read_csv(local_path1,
dtype={'Device': 'category','Source Id':'category'},
parse_dates=['Start Ts', 'End Ts'])
orders = pd.read_csv(local_path2,
parse_dates=['Buy Ts'])
costs = pd.read_csv(local_path3,
dtype={'source_id':'category'},
parse_dates=['dt'])
visits.info(memory_usage='deep')
orders.info(memory_usage='deep')
costs.info(memory_usage='deep')
visits.columns=visits.columns.str.lower() #changing the column names to lowercase
visits.head(10)
visits.tail(10)
# Changing the names of the columns
visits = visits.rename(columns={'end ts': 'end_ts','source id':'source_id','start ts':'start_ts'})
print('% Of values in Device\n\n{}'.format(visits.device.value_counts()/len(visits)))
print('% Of values in Source Id\n\n{}'.format(visits['source_id'].value_counts()/len(visits)))
visits[visits.duplicated(['end_ts','start_ts','uid'],keep=False)] #Looking for duplicates
In the 'visits' data frame there are no duplicates and no missing values. We can see that users loged in much more on a desktop device (73%) then from a touch screen (24%)..
orders.columns=orders.columns.str.lower() #changing the column names to lowercase
orders = orders.rename(columns={'buy ts': 'buy_ts'}) # Changing the names of the columns
orders.head(10)
orders.tail(10)
orders.describe()
We can see there are no missing values or duplicetes in the 'orders' df.
The 'Min' revenue is '0' we should look in to that it is not really possible to buy somthing for free it is an oxymoron.
orders[orders.duplicated(['buy_ts','uid'],keep=False)]
len(orders.query('revenue == 0'))
I have 51 rows with a revenue equal to '0' i will detet these rows
orders_clean = orders[orders['revenue'] != 0]
orders_clean.reset_index(drop=True,inplace=True)
len(orders_clean.query('revenue == 0'))
orders_clean.tail() # new df with no '0' in revenue
costs.head(10)
costs.tail(10)
costs.describe()
costs[costs.duplicated(keep=False)]
Costs df looks to be good to go.
For this task I will calculate the DAU, WAU and MAU. The number of active unique users for each day, week and month.
visits.head(5)
visits['session_year'] = visits['start_ts'].dt.year
visits['session_month'] = visits['start_ts'].dt.month
visits['session_week'] = visits['start_ts'].dt.week
visits['session_date'] = visits['start_ts'].dt.date
visits['month_year'] = pd.to_datetime(visits['start_ts']).dt.to_period('M')
visits['session_date']=pd.to_datetime(visits['session_date'])
visits.head(5)
# MAU
mau_total = visits.groupby(['month_year'],as_index=False).agg({'uid': 'nunique'})
mau_total
# WAU
wau_total = visits.groupby(['session_year', 'session_week'],as_index=False).agg({'uid': 'nunique'})
wau_total.head()
# DAU
dau_total = visits.groupby(['session_date'],as_index=False).agg({'uid': 'nunique'})
dau_total.head()
plt.figure(figsize=(15,8))
ax=sns.barplot(x='month_year', y='uid', data=mau_total)
ax.set_title('Number Of Active Unique Users Each Month', fontdict={'size':15})
for p in ax.patches:
ax.annotate(format(p.get_height(), '.0f'),
(p.get_x() + p.get_width() / 2.,
p.get_height()), ha = 'center', va = 'center',
xytext = (0, -12), textcoords = 'offset points',fontsize=14,color='w')
plt.xticks(rotation=30)
plt.xlabel("Month")
plt.ylabel("Number Of Sessions")
plt.show()
plt.figure(figsize=(15,8))
sns.lineplot(x='session_week', y='uid', data=wau_total,linewidth = 3).set_title('Number Of Active Unique Users Each Week'
, fontdict={'size':15})
plt.xlabel("Week")
x_ticks = np.arange(0, 55, 4)
plt.xticks(x_ticks)
plt.ylabel("Number Of Sessions")
plt.grid()
plt.show()
plt.figure(figsize=(15,8))
sns.lineplot(x='session_date', y='uid', data=dau_total,linewidth = 2).set_title('Number Of Active Unique Users Each Day'
, fontdict={'size':15})
plt.xlabel("date")
plt.ylabel("Number Of Sessions")
plt.grid()
plt.show()
In these three graphs we can see basically the same information but in different resolutions.
The first graph shows us unique users per month and there we can see that we have more visitors on average close to the holyday season. The number of sessions drop after March and and is droping untin September and then it rises again until the peek at November.
In the second graph we can see unique users per week, This graph on a whole tells us a similar story but we can see more fluctuations. Those changes can be explained with people being effected by holydays, Maybe some advertizing campaign or even when the paychek from work arrives Thing along this line.
The third graph is the same only by day, Here we can see in even higher resolution. again the simmilar story only this time we can clearly see a zig-zag pattern that i can guesse comes from the changes in peoples behaviour during the weekend. On top of that we can see that there are two verry distinct peeks, One of which is very high and the other very low. The high peek is around Christmass time but the other low peek seems to be at a random time. I would think about looking in to those couple of days in April.
daily_sessions = visits.groupby('session_date',as_index=False).agg({'uid':'count'})
avg_daily_ses = int(visits.groupby('session_date',as_index=False).agg({'uid':'count'}).mean())
print('Number of average daily sessions :{}'.format(avg_daily_ses))
daily_sessions.head()
plt.figure(figsize=(15,8))
sns.lineplot(x='session_date', y='uid', data=daily_sessions,linewidth = 2).set_title('Average Number Of Sessions Per Day'
, fontdict={'size':15})
plt.xlabel("date")
plt.ylabel("Number Of Average Sessions")
plt.axhline(avg_daily_ses,color = 'red', linewidth= 1,label ='Overall Average')
plt.legend()
plt.grid()
plt.show()
This graph is almost identicasl to the 'Number Of Active Unique Users Each Day' graph from the last section. The number of sessions per day is very close to the number of active users per day.
now i will plot both graphs together and see how close they are.
plt.figure(figsize=(15,6))
sns.lineplot(x='session_date', y='uid', data=daily_sessions,linewidth = 1).set_title('Average Number Of Sessions Vs Number Of Active Unique Users Per Day'
, fontdict={'size':15})
sns.lineplot(x='session_date', y='uid', data=dau_total,linewidth = 1)
plt.xlabel("date")
plt.ylabel("Number Of Average Sessions/ Users")
plt.grid()
plt.show()
Here we can really see that the two graphs are almost identical, One interesting thing to note is that on the high peek it seems like the number of sessions was high compared with the number of active users. This means that some users had a high number of sessions.
# Adding a column with avg number of sessions per user per day
ses_per_user = daily_sessions
ses_per_user['avg_per_user'] = ses_per_user['uid']/dau_total['uid']
ses_per_user
plt.figure(figsize=(15,8))
sns.lineplot(x='session_date', y='avg_per_user', data=ses_per_user,linewidth = 2).set_title('Average Number Of Sessions Per Day Per User'
, fontdict={'size':15})
plt.xlabel("date")
plt.ylabel("Number Of Average Sessions")
plt.grid()
plt.show()
In this graph we can see the average number of sessions per user per day. We can see that the high spike we had before around christmass time is still there but we can see that when we look at the average number of sessions we have two spikes around christmass time.
Now i will make a graph that shows average number of sessions per day per device.
# making the pivot table
visits_by_device = visits.pivot_table(index= 'session_date', columns ='device', values='uid',aggfunc='count')
visits_by_device.columns = visits_by_device.columns.categories
visits_by_device.reset_index(inplace=True)
visits_by_device.head()
plt.figure(figsize=(10,6))
plt.plot(visits_by_device.session_date,visits_by_device.desktop, label='Desktop')
plt.plot(visits_by_device.session_date,visits_by_device.touch, label='Touch')
plt.xlabel("date")
plt.ylabel("Number Of Average Users")
plt.title('Average Visits By Device')
plt.legend()
plt.grid()
plt.show()
In this graph we can see the average visits from the two devices: Desktop and Touch. From the desktop, As we saw earlier, We have much more visits but overall the behavior is pretty similar.
session_time = visits[['start_ts','end_ts','session_date','uid','device']]
session_time['sesion_length']=session_time['end_ts']-session_time['start_ts']
session_time['sesion_length']=session_time['sesion_length'].dt.seconds
avg_session_time = session_time['sesion_length'].mean()
print('the average session is {} seconds long or about {} minuts long'.format(int(avg_session_time),int((avg_session_time)/60)))
session_time_group = session_time.groupby(['session_date'],as_index=False).agg({'sesion_length':'mean'})
plt.figure(figsize=(10,6))
plt.plot(session_time_group.session_date,session_time_group.sesion_length)
plt.xlabel("date")
plt.ylabel("Seconds")
plt.title('Average Session length')
plt.axhline(avg_session_time,color = 'red', linewidth= 1,label ='Overall Average')
plt.legend()
plt.grid()
plt.show()
Once again the graph shows us that even the sesion lengths are compatible with all the other parameters we looked at untile now. The same spike arount Christmass and the same low around april (someone shold really see what went on that april). All and all this graph looks to be more even then the others across the year.
I just want to check different platforms and see if the average session time changes
session_time_desktop = session_time.query('device =="desktop"')
session_time_touch = session_time.query('device =="touch"')
average_session_time_desktop = session_time_desktop['sesion_length'].mean()
average_session_time_touch =session_time_touch['sesion_length'].mean()
average_session_time_desktop
average_session_time_touch
print('The average session on a desktop is: {} seconds\nThe average session on a touch is: {} seconds'.format(average_session_time_desktop,average_session_time_touch))
We can see that people on desktop spend allmost 40% more time on the site then people with touch devices. Is that good or bad? it seems good but i think its a bit to early to say for sure. Maybe the site on the touch devices is more streamline? These are interesting questions that needs to be asked.
plt.figure(figsize=(15,6))
plt.hist(session_time.sesion_length,bins=11,range=(300,900),color='c',label='Data')
plt.axvline(session_time.sesion_length.mean(), color='k', linestyle='dashed',label='Average')
plt.axvline(session_time.sesion_length.median(), color='r', linestyle='dashed',label='Median')
plt.xlabel("Session Time (sec)")
plt.ylabel("Number Of Sessions")
plt.title('Session length')
plt.legend()
plt.show()
In the histogram above we can see the distribution of the session times. We can see that the average session time is a bit more than 600 seconds (10 min). We can see that that this is the tail end of an exponential distribution, We have most of the users sessions less than the average but we have some verry high outliers. from the shorter sessions that are the majority we slowly climb up the sesion time but the cases are getting fewer and fewer until the session time will be verry long but very rare.
To answer this i will find the retention by month.
visits_first_1 = visits.groupby(['uid'])['start_ts'].min()
visits_first_1.name = 'first_activity_date'
visits_big = visits
visits_big = visits_big.join(visits_first_1,on='uid')
visits_big['session_month']= visits_big['session_date'].astype('datetime64[M]')
visits_big['first_session_month'] = visits_big['first_activity_date'].astype('datetime64[M]')
visits_big['cohort_lifetime']= visits_big['session_month']- visits_big['first_session_month']
visits_big['cohort_lifetime'] = visits_big['cohort_lifetime']/ np.timedelta64(1,'M')
visits_big['cohort_lifetime'] = visits_big['cohort_lifetime'].round().astype(int)
cohorts = visits_big.groupby(['first_session_month','cohort_lifetime']).agg({'uid':'nunique'}).reset_index()
initial_users_count = cohorts[cohorts['cohort_lifetime']==0][['first_session_month','uid']]
initial_users_count = initial_users_count.rename(columns={'uid':'cohort_users'})
cohorts = cohorts.merge(initial_users_count,on='first_session_month')
cohorts['retention'] =cohorts['uid']/cohorts['cohort_users']
cohorts['just_date'] = cohorts['first_session_month'].dt.date
retention_pivot = cohorts.pivot_table(index='just_date',columns='cohort_lifetime', values='retention', aggfunc='sum')
retention_pivot
sns.set(style='white')
plt.figure(figsize=(15, 8))
cmap = mcolors.LinearSegmentedColormap.from_list(name="n",colors=['azure','purple','blue','green','yellow','orange','red'])
sns.heatmap(retention_pivot, annot=True, fmt='.1%', linewidths=1, linecolor='gray',cmap=cmap)
plt.title('Cohorts: User Retention', fontdict={'size':15})
plt.xlabel("Retention")
plt.ylabel("First Session")
plt.show()
We can see in the Hitmap that the first cohort had a rise between the 4th and 5th month besides that all the cohorts are going steadily down. we can also see that for some reason all the cohorts got to around 2%-3% in their last month (excluding the first and last cohorts) this is not good becouse as time goes by it takes them a shorter and shorter time to get to 2%, We can see in the cohort of 7.2017 it took 9 months to get this low and the 4.2018 it took only 2 months to get to 2% retention rate. the users seem to be leaving more and more.
I will calculate the convertion rate
# Grouping by user and first purchase date
orders_grouped = orders.groupby(['uid'])['buy_ts'].min()
orders_grouped.name = 'first_buy_date'
orders_grouped.head()
visits_first = visits_big[['uid','first_activity_date','cohort_lifetime','device','first_session_month','source_id']]
visits_first = visits_first.merge(orders_grouped, on='uid',how='outer')
visits_first['first_activity_date'] = visits_first['first_activity_date'].dt.date
visits_first['first_buy_date'] = visits_first['first_buy_date'].dt.date
visits_first['convertion'] = visits_first['first_buy_date']- visits_first['first_activity_date']
visits_first.drop_duplicates(subset='uid',inplace=True)
visits_first.reset_index(inplace=True, drop=True)
visits_first.head(10)
no_convertion = len(visits_first.query('convertion == "nan"'))
print('% of users that did not convert at all is : {}'.format(no_convertion/len(visits_first)))
There is a really big percentage of users (84%) that did not convert at all
desktop_buy = visits_first.query('device == "desktop"')
touch_buy = visits_first.query('device == "touch"')
desktop_no_convertion= len(desktop_buy.query('convertion == "nan"'))
touch_no_convertion = len(touch_buy.query('convertion == "nan"'))
print('% of users on a desktop that did not convert at all is : {}'.format(desktop_no_convertion/len(desktop_buy)))
print('% of users on a touch device that did not convert at all is : {}'.format(touch_no_convertion/len(touch_buy)))
We can see that users that came to our site on a touch device converted less then users from a desktop. The difference is sagnificant around 8%.
convertions= visits_first.dropna(subset=['convertion']).reset_index(drop=True)
convertions['convertion']=convertions['convertion'].dt.days.astype('int16')
convertions.head()
convertion_0 = convertions.query('convertion == 0')
convertion_week = convertions.query('convertion > 0 & convertion < 8')
convertion_2weeks = convertions.query('convertion >= 8 & convertion < 15')
convertion_month = convertions.query('convertion >= 15 & convertion <= 30')
convertion_over_month =convertions.query('convertion > 30')
print('% of users that converted on first day: {}'.format(len(convertion_0)/len(convertions)))
print('% of users that converted on first week: {}'.format(len(convertion_week)/len(convertions)))
print('% of users that converted on second week: {}'.format(len(convertion_2weeks)/len(convertions)))
print('% of users that converted on third or forth week: {}'.format(len(convertion_2weeks)/len(convertions)))
print('% of users that converted after the first month: {}'.format(len(convertion_over_month)/len(convertions)))
If a user converted usually it happends in the first day (68%), 11% of converted users will do so on their first week (excluding the first day), So within the first week we can see that close to 80% of the convertions happen. there is a 5% chance a user that did not convert on the first week will convert on the first month, But there is still hope becouse 13.5% of converted users do so after the first month.
Next i will look at each cohort and find the % of converted users that converted within the first week.
</span>
convertion_1st_week = convertions.query('convertion < 8')
convertion_cohorts_1st_week = convertion_1st_week.groupby(['first_session_month']).agg({'uid':'nunique'}).reset_index()
convertion_cohorts_1st_week.rename(columns={'uid':'convertions'},inplace=True)
convertion_cohorts = convertions.groupby(['first_session_month']).agg({'uid':'nunique'}).reset_index()
convertion_cohorts= convertion_cohorts.merge(convertion_cohorts_1st_week,on='first_session_month')
convertion_cohorts['first_session_month']= convertion_cohorts['first_session_month'].astype(str)
convertion_cohorts['% convertion']= convertion_cohorts['convertions']/convertion_cohorts['uid']
convertion_cohorts
plt.figure(figsize=(15,8))
ax= sns.barplot(x='first_session_month', y='% convertion', data=convertion_cohorts)
ax.set_title('First Week Convertion By Cohort (converted only)', fontdict={'size':15})
for p in ax.patches:
ax.annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2.,
p.get_height()), ha = 'center', va = 'center',
xytext = (0, -12), textcoords = 'offset points',fontsize=14,color='w')
plt.xticks(rotation=30)
plt.xlabel("Cohort Month")
plt.ylabel("Mean Convertion Days")
plt.show()
We can see in the graph above that in each new cohort the convertion rate within the first week is going up. This is of cours only looking at the users that have converted.
Next I will do the same only this time see the % of users that converted within the first week from the total number of users in the cohort(not only from the converted users)
all_visits_cohorts = visits_first.groupby('first_session_month').agg({'uid':'nunique'}).reset_index()
all_visits_cohorts = all_visits_cohorts.merge(convertion_cohorts_1st_week, on='first_session_month')
convertions_total = visits_first.query('convertion != "NaT"')
convertions_total_cohorts = convertions_total.groupby('first_session_month').agg({'uid':'nunique'}).reset_index()
convertions_total_cohorts.rename(columns={'uid':'total_convertions'},inplace=True)
all_visits_cohorts = all_visits_cohorts.merge(convertions_total_cohorts, on='first_session_month')
all_visits_cohorts['first_session_month'] = all_visits_cohorts['first_session_month'].astype(str)
all_visits_cohorts['% convertions'] =all_visits_cohorts['convertions']/all_visits_cohorts['uid']
all_visits_cohorts['% total_convertions'] =all_visits_cohorts['total_convertions']/all_visits_cohorts['uid']
plt.figure(figsize=(15,10))
ax = sns.barplot(x='first_session_month', y='% total_convertions', data=all_visits_cohorts, color = "red")
ax.set_title('Convertion By Cohort (all users)', fontdict={'size':15})
bottom_plot = sns.barplot(x = 'first_session_month', y = '% convertions',data=all_visits_cohorts, color = "#0000A3")
topbar = plt.Rectangle((0,0),1,1,fc="red", edgecolor = 'none')
bottombar = plt.Rectangle((0,0),1,1,fc='#0000A3', edgecolor = 'none')
l = plt.legend([bottombar, topbar], ['First Week Convertions', 'Total Convertions'], loc=1, ncol = 2, prop={'size':16})
plt.xticks(rotation=30)
plt.xlabel("Cohort Month")
plt.ylabel("% of convertions")
plt.show()
In this graph we can see the total % of convertions from each cohort and the % of the convertions within the first week. The first cohort is in the lead of total convertions, The last cohort and the first are in the lead for convertions on the first week. There is a downward trend on total converions from the first cohort to one before the last, But we can see that convertions for the first week go up and down but seem to overall be stable around the 0.13%.
def convertion_time(row):
convertion_time=row['convertion']
if convertion_time == 0:
return '00d'
elif convertion_time < 8:
return '07d'
elif convertion_time < 15:
return '14d'
elif convertion_time < 31:
return '30d'
else:
return '31+d'
convertions['convertion_days']=convertions.apply(convertion_time,axis=1)
convertions.head()
convertions_groupd = convertions.groupby(['first_session_month','convertion_days']).agg({'uid':'nunique'}).reset_index()
convertions_groupd = convertions_groupd.merge(initial_users_count,on='first_session_month')
convertions_groupd['%convertion'] = convertions_groupd['uid']/convertions_groupd['cohort_users']
convertions_groupd['just_date'] = convertions_groupd['first_session_month'].dt.date
convertions_groupd.head()
convertions_pivot = convertions_groupd.pivot_table(index='just_date',columns='convertion_days',values='%convertion',aggfunc='sum')
convertions_pivot.sort_index(axis='columns', level='convertion_days')
convertions_pivot
sns.set(style='white')
plt.figure(figsize=(13, 9))
#cmap = mcolors.LinearSegmentedColormap.from_list(name="n",colors=['azure','purple','blue','green','yellow','orange','red'])
sns.heatmap(convertions_pivot, annot=True, fmt='.1%', linewidths=1, linecolor='gray')
plt.title('User Convertion %', fontdict={'size':15})
plt.xlabel("Convertion Days")
plt.ylabel("First Session")
plt.show()
This heatmap shows the convertion rate for each of the cohorts. The first cohort is the strongest overall, Its clear that most of the convertions happen on the first day. all the cohorts show a similar overall behavior of not showing allmost any convertions on the first month if its not the first week (as we saw before). the last column of the 31+ days wount be acurate becouse it has no time limit and thus the first cohort will have more convertions then the one before the last, What we can say from the last column is that users tent to convert even after the first month, Those are late converters and they can reach a substantial amount.
orders['order_date']= orders['buy_ts'].dt.date
orders['order_week']= orders['buy_ts'].dt.week
orders['order_week']= orders['buy_ts'].dt.week
orders['order_month']= orders['buy_ts'].dt.month
avg_dsily_purchases=int(orders.groupby('order_date').agg({'uid':'count'}).mean())
print('The average number of daily purchases is: {}'.format(avg_dsily_purchases))
orders_source = pd.merge(orders,visits_first[['source_id','first_activity_date','cohort_lifetime','uid']],on='uid')
orders_source.head()
orders_source_grouped = orders_source.groupby(['source_id','order_month']).agg({'uid':'count'}).reset_index()
orders_source_grouped.head()
fig=px.line(orders_source_grouped, x='order_month', y='uid', color='source_id',
labels={'order_month':'Month','uid':'Purchases'},
title="Average Number of Purchases Per Month Per Ad Source")
fig.update_layout(
autosize=False,
width=1000,
height=700)
fig.show()
In the graph above we see tha avegare purchases every month for each of the ad sources. Ad source number 4 is allmost constently at the lead, Except on the 8th month where ad source number 5 got the lead.
We can see that ad sources numbers 7 and 6 are totaly useless we can get rid of them. As too sources numbers 9 and 10 are allmost not bringing purchases at all.
In general there is a drop in the middle of the year that rises and allmost triples twards the holyday season at the end of the year.</span.
avg_purchas= orders['revenue'].mean()
print('Average of all purchases is: {:.2f}'.format(avg_purchas))
Now i'll calculate the average purchase for each day
orders_revenue=orders_source.groupby('order_date',as_index=False).agg({'revenue':'mean'})
orders_revenue.head()
plt.figure(figsize=(15,8))
sns.lineplot(data=orders_revenue,x='order_date',y='revenue',linewidth='2',alpha=0.8)
plt.grid()
plt.title('Average Purchases Amount',fontdict={'size':20})
plt.xlabel("Date")
plt.ylabel("Purchase Amount")
plt.axhline(avg_purchas,color = 'red', linewidth= 2,label ='Overall Average',alpha=0.5)
plt.legend(frameon=False,fontsize=15)
plt.show()
Its not a surprise that there is a huge jump in average purchase size on days before christmas. We can also see that the graph has more peaks that can probab ly coraspond with holydays though no holyday compares with christmas.
Now, like before lets look at the average purchase anount per month per ad source.
order_source_revenue = orders_source.groupby(['source_id','order_month']).agg({'revenue':'mean'}).reset_index()
order_source_revenue.head()
fig=px.line(order_source_revenue, x='order_month', y='revenue', color='source_id',
labels={'order_month':'Month','revenue':'Purchases Sum'},
title="verage Sum Of Purchases Per Month Per Ad Source")
fig.update_layout(
autosize=False,
width=1000,
height=700)
fig.show()
Very interesting graph when compared to the "Average Number of Purchases Per Month Per Ad Source". here we see that most of the spike that we have during christmass time comes from one source (number 2). And in number of purchases source number 4 was in the lead, Here we see that it is 4th most of the year when it coumes to sum of ther purchases. This means that maybe it brings a lot of buying customers, They buy in smaller amounts relative to other sources. So here on top we have source number 2 and behind it we have source number 1.
sources 9-10 that didnt bring a lot of purchases, Here in the sum of purchases they show more promise, but are still on the lower end.
orders['order_month_datetime'] = orders['order_date'].astype('datetime64[M]')
orders_first = orders.groupby('uid').agg({'order_month_datetime':'min'}).reset_index()
orders_first.columns = ('uid','first_order_month')
orders_cohort_size = orders_first.groupby('first_order_month').agg({'uid':'nunique'}).reset_index()
orders_cohort_size.columns = ('first_order_month','n_buyers')
orders_= pd.merge(orders,orders_first, on='uid')
cohorts_orders = orders_.groupby(['first_order_month','order_month_datetime']).agg({'revenue':'sum'}).reset_index()
cohorts_orders.sample(10)
report = pd.merge(orders_cohort_size, cohorts_orders, on='first_order_month')
report.head()
margin_rate =.5
report['gp']= report['revenue']*margin_rate
report['age']= (report['order_month_datetime']-report['first_order_month'])/np.timedelta64(1, 'M')
report['age'] = report['age'].round().astype('int')
report['ltv'] = report['gp'] / report['n_buyers']
report['just_date'] = report['first_order_month'].dt.date
report.head(20)
output = report.pivot_table(
index='just_date',
columns='age',
values='ltv',
aggfunc='mean')
output.fillna('')
sns.set(style='white')
plt.figure(figsize=(13, 9))
cmap = mcolors.LinearSegmentedColormap.from_list(name="n",colors=['azure','purple','blue','green','yellow','orange','red'])
sns.heatmap(output, annot=True, linewidths=1, linecolor='gray',cmap='RdBu_r')
plt.title('Cohorts: LTV', fontdict={'size':15})
plt.xlabel("Ltv")
plt.ylabel("First Session")
plt.show()
We can see that the LTV for each cohort is high on the first month and then it goes down drasticaly, I belive that most users tend to buy on their first days after they register and after that less users come back, As we saw before in this progect.
overall_cost = costs['costs'].sum()
print('Overall costs are: {}'.format(overall_cost))
I will check costs per ad source.
costs_grouped_source = costs.groupby('source_id').agg({'costs':'sum'}).reset_index()
fig,ax=plt.subplots(figsize=(6,4))
ax.vlines(x=costs_grouped_source.source_id, ymin=0,ymax=costs_grouped_source.costs, color='red',
alpha=0.7, linewidth=10)
ax.set_title('Costs By Ad source',fontdict={'size':15})
ax.set_xlabel('Ad Source')
ax.set_ylabel('Cost')
plt.grid()
plt.show()
This graph shows us the costs per ad source, We can see that source number 3 takes almost half of the whole budget. We can also see that sources 7,6 and 8 are missing, therfore we are not paying for them. That explains why so little of our users come from them. we can also explain why sources 9 and 10 bring little users, they are a small fraction the cost of the rest of the sources.
Now i will check cost per month.
costs['month']=costs['dt'].dt.month
costs_grouped_source
costs_grouped_month = costs.groupby('month').agg({'costs':'sum'}).reset_index()
fig,ax=plt.subplots(figsize=(6,4))
ax.vlines(x=costs_grouped_month.month, ymin=0,ymax=costs_grouped_source.costs, color='red',
alpha=0.7, linewidth=10)
ax.set_title('Costs By Month',fontdict={'size':15})
ax.set_xlabel('Month')
ax.set_ylabel('Cost')
plt.grid()
plt.show()
There are two big pushes a year, One in april and the other just before tchristmas and there are three months where there is almost no money going to ad sources, Febuary, july and September. Allthogh there is a push, or a high cost for ads around april we see the opposite effects in the graphs we plotted before, there is a drop in sales after april.
I will calculate thw CAC for each of the sources
last_visits = visits.sort_values(by='start_ts', ascending=True
).groupby('uid', as_index=False).agg({'source_id':'last'})
last_visits_grouped = last_visits.groupby('source_id').agg({'uid':'count'}).reset_index()
cost_grouped= pd.merge(last_visits_grouped,costs_grouped_source,on='source_id',)
cost_grouped['cac']= cost_grouped['costs']/ cost_grouped['uid']
cost_grouped
fig,ax=plt.subplots(figsize=(6,4))
ax.vlines(x=cost_grouped.source_id, ymin=0,ymax=cost_grouped.cac, color='red',
alpha=0.7, linewidth=10)
ax.set_title('Cac By Ad Source',fontdict={'size':15})
ax.set_xlabel('Ad Source')
ax.set_ylabel('Cac')
plt.grid()
plt.show()
This graph shows us customer Acquisition Cost per ad source, there are several as sources that we did not pay for and they are not in the graph. the most cost efective were sources 10,4 and 9 and the priciest were 2 and 3. Source 2 was one of the priciest but we saw in a graph earlier that it brought us the most revenue, so it was worth while. On the other hand source 10 didnt bring us as much revenue so i think this ad source shold be reconsiderd.
I'll start by calculationg ROI for each cohort. I have the 'report' DF and i will build on it.
costs['dt'].describe()
orders['buy_ts'].describe()
costs['cost_month']= costs['dt'].astype('datetime64[M]')
I can see that the costs DF is one month shorter then the orders DF so we wount have cost data on the last month that we have data on the orders.
#visits_ad_source = visits_big.groupby('uid').agg({'source_id':'last'}).reset_index()
visits_ad_source = visits_big.groupby('uid').agg({'uid':'last','source_id':'last'})
visits_ad_source.columns =['uid','uid_ad_source']
visits_ad_source.head()
visits_ad_source.reset_index(drop=True,inplace=True)
visits_ad_source.head()
orders_ad_source = pd.merge(visits_ad_source,orders,on='uid')
orders_ad_source.head()
ad_cohort_size = orders_ad_source.groupby('uid_ad_source').agg({'uid':'nunique'}).reset_index()
ad_cohort_size.columns = ['uid_ad_source','n_users']
ad_cohort_size
ad_cohorts = orders_ad_source.groupby(['uid_ad_source','order_month_datetime']).agg({'revenue':'sum'}).reset_index()
report_ad = pd.merge(ad_cohorts,ad_cohort_size,on='uid_ad_source')
report_ad.head()
report_ad['gp']= report_ad['revenue']*margin_rate
report_ad['ltv'] = report_ad['gp'] / report_ad['n_users']
report_ad.rename(columns={'uid_ad_source':'source_id'},inplace=True)
report_ad.head()
monthly_ad_costs = costs.groupby(['source_id','cost_month']).sum().reset_index()
monthly_ad_costs.rename(columns={'cost_month':'order_month_datetime'},inplace=True)
monthly_ad_costs.head()
report_ad_ = pd.merge(report_ad,monthly_ad_costs,on=['source_id','order_month_datetime'],how='outer')
report_ad_.head()
report_ad_['cac']= report_ad_['costs']/report_ad_['n_users']
report_ad_['romi']= report_ad_['ltv']/report_ad_['cac']
report_ad_['just_date'] = report_ad_['order_month_datetime'].dt.date
report_ad_.head()
output_romi = report_ad_.pivot_table(index='source_id',columns='just_date',values='romi',aggfunc='mean')
output_romi_cumsum = output_romi.cumsum(axis=1).round(2)
output_romi_cumsum
sns.set(style='white')
plt.figure(figsize=(15, 9))
sns.heatmap(output_romi_cumsum, annot=True, linewidths=1, linecolor='gray',cmap='RdBu_r')
plt.title('Ad source: Romi', fontdict={'size':15})
plt.xlabel("Month")
plt.ylabel("Ad Source")
plt.xticks(rotation=40)
plt.show()
output_romi_cumsum.head()
We can see in the graph that by the 4rd month most of the ad sources started to pay off. source number 3 is the last one to pay off and it did so only in the 6th month. the fastests and most profitable according to this table is source number 1 that started to be profitable on the second month and in the last data point we have brings 12 times more profit then cost.
In this Project we looked at data of visits orders and ad costs for Yandex Afisha.
We saw that the monthly, weekly and daily number of users and number of sessions is changing mainly due to seasonal changes (holidays ect..). We saw that overall most users are logging in through their desktop devices, much more than on their touch device and the desktop session length in almost twice as long.
When we looked at conversion rates we calculated that most users that will convert will do so on the first day (almost 69%) , if not they will either convert on the first week(11%) or only after the first month (13%), they are unlikely to convert in first month if it didn't happen within the first week. When looking by cohort, The conversion rate for the first week was rising in each cohort and reached the peak in the last cohort, the 2018-05 cohort.
When it comes to marketing first thing is first. Something has to change with source number 3, It is the most expensive, when we looked at the ROI it was the last one to become profitable. We can also see that the users it brings are not the best spenders. I would recommend cutting the expenses on this ad source by a lot. Second thing is sources 1 and 2 should get some more of the overall marketing budget. They are the fastest to get profitable and they bring users that, on average, spend more than other resources, rising their budget will increase the companies profits. Source number 9 is another one I would recommend giving more of the budget to. As of now it is one of the cheapest sources and accordingly it isn't high on the number of users it brings nor do those users spend a lot but its ROI is the third highest.
One more thing i would look into is the abnormal dip in the number of sessions and active users around the beginning of april, When we looked at the graphs of user activity there was a really hard dip in that area, maybe there was a technical thing on the site that stopped users from logging in. This brings us to the monthly ad budget. We see a rise in the cost around the end of the year, That rise corresponds with a very high rise in total sales . For some reason we see a sharp rise in the ad costs in the month of April, This rise does not correspond with any rise in sales around that time, Actually we can see the opposite, we see a general fall in sales around this time. So i will recommend to lower the ad costs of the month of April seeing that there is not much benefits it brings and it one of the most expensive months.